It's All About The Goooooooooooooooooooooooooooooooooal!!!

Citations are represented with prefix şş followed by reference link.

Authors: Chung, Nguyen, Pillay and Wang
Southern Methodist University

A. Business Understanding

Kaggle dataset (https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset) players_20.csv is used in this lab. The dataset provides detailed information of all the soccer players statistics in various clubs of major soccer leagues in the world. The data is originally from the FIFA soccer game created by EA sports. This game estimates the abilities of the actual players and built the game according to the data. And the Kaggle dataset is scraped from www.sofifa.com, where the gaming data is collected. The data we used is last updated on Sept 19th 2019.

The dataset is important because the abilities of players is estimated from the actual players. FIFA is a very popular game and EA Sports is one of the largest sport video game developer. The player abilities estimate is quite accurate. Therefore, useful knowledge can be mined from the data for analyzing variety problems in the soccer industry. For example, wage analysis, player analysis, training strategy, budget analysis, sport gambling strategy plan can be performed using this dataset. Soccer has been a big industry which the market size is estimated to worth $488 billion in 2018 according to the Business Wire.

Some of the analyses we are interested in to performs are:

  1. Run predictive model to predict wage from players abilities. We will select some features to run regression model. Also, we will try running PCA and regression model. To measure the effectiveness, we will use RMSE, MAE and R squared.

  2. Run classification model to classify players position from the players ability. To validate our model, we will use accuracy, precision, F1 score and ROC.

We also intent to use the detailed game statistics from fbref.com. We can potential run analysis using the win/lose results from the real time data.

şş Image captured from https://en.wikipedia.org/wiki/FIFA_20
image.png

B. Data Meaning Type

In [1]:
#All Python module imports
#https://pandas.pydata.org/docs/user_guide/index.html#user-guide
import pandas as pd #Pandas Dataframe module

import numpy as np
from math import pi
#scikit learn 
#https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
import sklearn as sl 


import pycountry
import plotly.express as px

#https://seaborn.pydata.org
import seaborn as sns
import matplotlib.pyplot as plt

# os calls
import os

#Module for formating table for documentation 
#https://pypi.org/project/tabulate/
from tabulate import tabulate

from IPython.display import display, Markdown

#Interactive mode
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import Image

Using our custom metadata file

Due to the large number of features we have created a csv file that will serve as a metadata file as follows
1) File location: ../data/fifa20/data_desc.csv
2) The Name column as the exact name that is in our original dataset file and we will use this to load the file.
3) Description: The team has filled up the description of each feature describing it.
4) Variable_cat: We use this to categorise related features for example 'Position' indicates all the features related to various position attirbutes.
Using a metadata makes it easy in case we want to ignore certain attributes that are not relavant for our analysis by marking it as Ignore.

In [2]:
#Read data explanation csv file maintained by project team
%time df_desc = pd.read_csv('../data/fifa20/data_desc.csv',usecols=['Name','Description','Statistics','Variable_cat'])
display(Markdown(df_desc[['Name','Description','Variable_cat']].to_markdown()))

#All positional attributes, These are various field positions a players can playin with a players statisics in
#that position. These attributes are only for non-goalkeepers.
df_position_attr = df_desc[df_desc['Variable_cat']=='Position']['Name']

#All goalKeeper only attributes
df_gk_attr = df_desc[df_desc['Variable_cat']=='GK']['Name']

#All Technical attributes. Non GK players primary technical skills such as dribbling, curving etc.
df_technical_attr = df_desc[df_desc['Variable_cat']=='Technical']['Name']

#All mental attributes for all players
df_mental_attr = df_desc[df_desc['Variable_cat']=='Mental']['Name']

#All Attacking attributes  for all players 
df_attacking_attr = df_desc[df_desc['Variable_cat']=='Attacking']['Name']

#All general Skill category attributes  for all players
df_skill_attr = df_desc[df_desc['Variable_cat']=='Skill']['Name']

#All Movement attributes  for all players
df_movement_attr = df_desc[df_desc['Variable_cat']=='Movement']['Name']

#All Power attributes  for all players
df_power_attr = df_desc[df_desc['Variable_cat']=='Power']['Name']

#All Defending attributes  for all players
df_defending_attr = df_desc[df_desc['Variable_cat']=='Defending']['Name']

# Sample code for future use to include adhoc columns
# Get all columns of player posiotion, you can do same for technical, Physical, GK etc
## l=df_gk_attr.append(pd.Series(['short_name','age','club']))
## df_gkplus_list from main object df_players = df_players[l]
CPU times: user 2.91 ms, sys: 1.48 ms, total: 4.4 ms
Wall time: 4.07 ms
Name Description Variable_cat
0 ls Left Striker: Responsible to score goals on left side Position
1 st Striker: Responsible to score most goals from center Position
2 rs Right Striker: Responsible to score goals on right side Position
3 lw Left Wing: The left winger will play just ahead of the midfield position and wider than the forward. They're usually fast, can take on players, and have great crossing ability. Position
4 lf Left Forward: A left forward will start in front of the left winger, or indeed, instead of a left winger if a team is playing attacking enough. These will loiter on their side of the opponent's area and are often full of spark and style. This is the kind of position Ronaldo and Messi take on, which allows them to easily switch sides when needed. Being able to cut inside into the centre of the pitch, take the ball to the byline, deliver a pinpoint cross or simply score goals are common traits of these players. Position
5 cf Center Forward: This is the attacker who lurks round the opponent's box, fighting for any scrap they can find. Centre forwards are often physically strong and efficient at scoring goals, or extremely creative in the opponent's area. Will have hardly any defensive duty, aside from jogging back to the half-way line when the other team has a corner, or by adding their presence to the defence from set-pieces in the final few minutes. Position
6 rf Right Forward: same as LF on right. Position
7 rw Right Wing: Same as lw on right Position
8 lam Left Attach Midfield: Position
9 cam Center Attacker Midfield: The most attacking variant of the midfield package, these players often sit behind the strikers and thread through balls in their direction. An attacking midfielder needs to have finishing ability, as they're the team's second most important source of goals. It's common to see players in this position with little tackling ability, as they get back to defend much less than their team-mates who share the same third of the field. Position
10 ram Right Attacking Midfield Position
11 lm Left Midfield: The left midfielder plays in front of the left-back, and is often one of the flair players. Usually one of the quickest in the team, these players are often skilful and can deliver a decent ball into the box. They should aid the defence as often as possible, but should also get forward to create and score goals. Position
12 lcm Left Center Midfield Position
13 cm Center Midfielder: By default, a centre midfielder should be doing the most running on the pitch. Their position holds many responsibilities, as their success can often dictate play for the rest of the team. A world class midfielder can tackle admirably, has great vision, and can get forward to score many goals. In fact, top centre midfielder players in domestic championships such as the Premier League and La Liga often turn in enough goals to make up for misfiring strikers. Position
14 rcm Right Center Midfielder: Position
15 rm Right Midfielder Position
16 lwb Left Wing Back Position
17 ldm Left Defensive Midfield Position
18 cdm Center Defensive Midfielder: Situated just behind the half way line, players in this position are often more mobile than defenders, but aren't natural goalscorers. They're prone to mopping up lose balls and dictating attacking moves by spraying the sphere forward. Adding extra support to the defence when under pressure, and breaking forward to the edge of the other team's box, defensive midfield players provide a vital backbone that is easy to overlook. Position
19 rdm Right Defensive Midfielder: Position
20 rwb Right Wing Back Position
21 lb Left Back Position
22 lcb Left Center Back Position
23 cb Center Back: The behemoth at the heart of defence. Each team is usually made up from two or three centre backs, as they provide the muscle and height to take on tricky attackers. In many ways, a dominant centre back is the perfect leader on the pitch, as they can scream commands to other players from their position that allows them full sight of the whole pitch. Expect them to be hard-tackling, header winning Goliath-types, even if they're not from Eastern Europe. Position
24 rcb Right Center Back: Position
25 rb Right Back: The right-sided defender. See the description of the left back for more details, changing the key word where necessary. Position
26 short_name Players Name Regular
27 age Playes age Regular
28 dob Playes date of birth yyyy-mm-dd Regular
29 height_cm Playes height Regular
30 weight_kg Playes weight Regular
31 nationality Players nationality Regular
32 club Club the players plays for Regular
33 overall Players overall rating Regular
34 potential Players potential Regular
35 value_eur Players value in euros Regular
36 wage_eur Players pay per month in euros Regular
37 player_positions Possible player positions, refer to all positions listed Regular
38 preferred_foot Preferred foot of the player Regular
39 international_reputation Players reputation on a scale of 5 Regular
40 weak_foot Players non dominant foot rating on a scale of 5 Regular
41 body_type Body type of a player. e.g Normal, Lean, Stocky… Regular
42 release_clause_eur Contract date when a player will be released form the club. Regular
43 team_position Players regular position, can have multiple positions. Each position be described below. Regular
44 nation_position Players position where he plays for his nation when not in the club Regular
45 pace Pace is the a physical attribute, the speed of the player Technical
46 shooting Technical attribute, the players shooting power Technical
47 passing Technical attribute. Players passing skill scale of 100 Technical
48 dribbling Technical attribute. Players dribbling skill scale of 100 Technical
49 defending Technical attribute. Players defending skill scale of 100 Technical
50 physic Physical attribute. Scale of 100 Technical
51 gk_diving Goalkeepers diving rating on scale of 100 GK
52 gk_handling Goalkeepers ball handling rating on scale of 100 GK
53 gk_kicking Goalkeepers kicking rating on scale of 100 GK
54 gk_reflexes Goalkeepers reflexes rating on scale of 100 GK
55 gk_speed Goalkeepers speed rating on scale of 100 GK
56 gk_positioning Goalkeepers positioning rating on scale of 100 GK
57 attacking_crossing Technical skill Attacking
58 attacking_finishing Technical skill rating on scale of 100 Attacking
59 attacking_heading_accuracy Technical skill rating on scale of 100 Attacking
60 attacking_short_passing Technical skill rating on scale of 100 Attacking
61 attacking_volleys Technical skill rating on scale of 100 Attacking
62 skill_dribbling Technical skill rating on scale of 100 Skill
63 skill_curve Technical skill rating on scale of 100 Skill
64 skill_fk_accuracy Technical skill rating on scale of 100 Skill
65 skill_long_passing Technical skill rating on scale of 100 Skill
66 skill_ball_control Technical skill rating on scale of 100 Skill
67 movement_acceleration Physical skill rating on scale of 100 Movement
68 movement_sprint_speed Physical skill rating on scale of 100 Movement
69 movement_agility Physical skill rating on scale of 100 Movement
70 movement_reactions Physical skill rating on scale of 100 Movement
71 movement_balance Physical skill rating on scale of 100 Movement
72 power_shot_power Technical skill rating on scale of 100 Power
73 power_jumping Physical skill rating on scale of 100 Power
74 power_stamina Technical skill rating on scale of 100 Power
75 power_strength Physical skill rating on scale of 100 Power
76 power_long_shots Technical skill rating on scale of 100 Power
77 mentality_aggression Mental skill rating on scale of 100 Mental
78 mentality_interceptions Mental skill rating on scale of 100 Mental
79 mentality_positioning Mental skill rating on scale of 100 Mental
80 mentality_vision Mental skill rating on scale of 100 Mental
81 mentality_penalties Mental skill rating on scale of 100 Mental
82 mentality_composure Mental skill rating on scale of 100 Mental
83 defending_marking Technical skill rating on scale of 100 Defending
84 defending_standing_tackle Technical skill rating on scale of 100 Defending
85 defending_sliding_tackle Technical skill rating on scale of 100 Defending
86 goalkeeping_diving rating on scale of 100 Ignore
87 goalkeeping_handling rating on scale of 100 Ignore
88 goalkeeping_kicking rating on scale of 100 Ignore
89 goalkeeping_positioning rating on scale of 100 Ignore
90 goalkeeping_reflexes rating on scale of 100 Ignore
91 sofifa_id Players Id Regular

image.png This image helps us understand various positions that a soccer players is assigned during a game. The feature player_positions represents multiple positions that a player can play in with a comma seperate value, the very first item in that list has a preferred field position of the player. The team_position is the current positon that the player is being used in the team. For our analysis we will use the players preferred positon.

The statistics of each individual player ability in respective positons are in seperate features on a scale of 1 to 100 as described in the table output aboive.

A goal keeper does not occupy any of the positions that a regular player occupies under normal circumstances which means he will not have any specific statistics for those features. A goal keeper on the other hand has a different set of features to capture his statistics, these are displayed in the table above in the Variable_cat field as 'GK'.

In [3]:
#Data mining read csv file 
#Using data set: https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset#players_15.csv
#we will read only data we are intrested in
cols_to_read = df_desc['Name'].loc[df_desc.Variable_cat != 'Ignore']
%time df_players = pd.read_csv('../data/fifa20/players_20.csv',usecols=cols_to_read)
CPU times: user 163 ms, sys: 38.7 ms, total: 201 ms
Wall time: 205 ms

Data shapes and columns

In [4]:
df_players.shape
df_players.describe()
df_players.info(verbose=True, null_counts=True)
Out[4]:
(18278, 87)
Out[4]:
sofifa_id age height_cm weight_kg overall potential value_eur wage_eur international_reputation weak_foot ... power_long_shots mentality_aggression mentality_interceptions mentality_positioning mentality_vision mentality_penalties mentality_composure defending_marking defending_standing_tackle defending_sliding_tackle
count 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 1.827800e+04 18278.000000 18278.000000 18278.000000 ... 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000
mean 219738.864482 25.283291 181.362184 75.276343 66.244994 71.546887 2.484038e+06 9456.942773 1.103184 2.944250 ... 46.812945 55.742149 46.380239 50.072163 53.609749 48.383357 58.528778 46.848889 47.640333 45.606631
std 27960.200461 4.656964 6.756961 7.047744 6.949953 6.139669 5.585481e+06 21351.714095 0.378861 0.664656 ... 19.322343 17.318157 20.775812 19.594022 13.955626 15.708099 11.880840 20.091287 21.585641 21.217734
min 768.000000 16.000000 156.000000 50.000000 48.000000 49.000000 0.000000e+00 0.000000 1.000000 1.000000 ... 4.000000 9.000000 3.000000 2.000000 9.000000 7.000000 12.000000 1.000000 5.000000 3.000000
25% 204445.500000 22.000000 177.000000 70.000000 62.000000 67.000000 3.250000e+05 1000.000000 1.000000 3.000000 ... 32.000000 44.000000 25.000000 39.000000 44.000000 39.000000 51.000000 29.000000 27.000000 24.000000
50% 226165.000000 25.000000 181.000000 75.000000 66.000000 71.000000 7.000000e+05 3000.000000 1.000000 3.000000 ... 51.000000 58.000000 52.000000 55.000000 55.000000 49.000000 60.000000 52.000000 55.000000 52.000000
75% 240795.750000 29.000000 186.000000 80.000000 71.000000 75.000000 2.100000e+06 8000.000000 1.000000 3.000000 ... 62.000000 69.000000 64.000000 64.000000 64.000000 60.000000 67.000000 64.000000 66.000000 64.000000
max 252905.000000 42.000000 205.000000 110.000000 94.000000 95.000000 1.055000e+08 565000.000000 5.000000 5.000000 ... 94.000000 95.000000 92.000000 95.000000 94.000000 92.000000 96.000000 94.000000 92.000000 90.000000

8 rows × 52 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18278 entries, 0 to 18277
Data columns (total 87 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   sofifa_id                   18278 non-null  int64  
 1   short_name                  18278 non-null  object 
 2   age                         18278 non-null  int64  
 3   dob                         18278 non-null  object 
 4   height_cm                   18278 non-null  int64  
 5   weight_kg                   18278 non-null  int64  
 6   nationality                 18278 non-null  object 
 7   club                        18278 non-null  object 
 8   overall                     18278 non-null  int64  
 9   potential                   18278 non-null  int64  
 10  value_eur                   18278 non-null  int64  
 11  wage_eur                    18278 non-null  int64  
 12  player_positions            18278 non-null  object 
 13  preferred_foot              18278 non-null  object 
 14  international_reputation    18278 non-null  int64  
 15  weak_foot                   18278 non-null  int64  
 16  body_type                   18278 non-null  object 
 17  release_clause_eur          16980 non-null  float64
 18  team_position               18038 non-null  object 
 19  nation_position             1126 non-null   object 
 20  pace                        16242 non-null  float64
 21  shooting                    16242 non-null  float64
 22  passing                     16242 non-null  float64
 23  dribbling                   16242 non-null  float64
 24  defending                   16242 non-null  float64
 25  physic                      16242 non-null  float64
 26  gk_diving                   2036 non-null   float64
 27  gk_handling                 2036 non-null   float64
 28  gk_kicking                  2036 non-null   float64
 29  gk_reflexes                 2036 non-null   float64
 30  gk_speed                    2036 non-null   float64
 31  gk_positioning              2036 non-null   float64
 32  attacking_crossing          18278 non-null  int64  
 33  attacking_finishing         18278 non-null  int64  
 34  attacking_heading_accuracy  18278 non-null  int64  
 35  attacking_short_passing     18278 non-null  int64  
 36  attacking_volleys           18278 non-null  int64  
 37  skill_dribbling             18278 non-null  int64  
 38  skill_curve                 18278 non-null  int64  
 39  skill_fk_accuracy           18278 non-null  int64  
 40  skill_long_passing          18278 non-null  int64  
 41  skill_ball_control          18278 non-null  int64  
 42  movement_acceleration       18278 non-null  int64  
 43  movement_sprint_speed       18278 non-null  int64  
 44  movement_agility            18278 non-null  int64  
 45  movement_reactions          18278 non-null  int64  
 46  movement_balance            18278 non-null  int64  
 47  power_shot_power            18278 non-null  int64  
 48  power_jumping               18278 non-null  int64  
 49  power_stamina               18278 non-null  int64  
 50  power_strength              18278 non-null  int64  
 51  power_long_shots            18278 non-null  int64  
 52  mentality_aggression        18278 non-null  int64  
 53  mentality_interceptions     18278 non-null  int64  
 54  mentality_positioning       18278 non-null  int64  
 55  mentality_vision            18278 non-null  int64  
 56  mentality_penalties         18278 non-null  int64  
 57  mentality_composure         18278 non-null  int64  
 58  defending_marking           18278 non-null  int64  
 59  defending_standing_tackle   18278 non-null  int64  
 60  defending_sliding_tackle    18278 non-null  int64  
 61  ls                          16242 non-null  object 
 62  st                          16242 non-null  object 
 63  rs                          16242 non-null  object 
 64  lw                          16242 non-null  object 
 65  lf                          16242 non-null  object 
 66  cf                          16242 non-null  object 
 67  rf                          16242 non-null  object 
 68  rw                          16242 non-null  object 
 69  lam                         16242 non-null  object 
 70  cam                         16242 non-null  object 
 71  ram                         16242 non-null  object 
 72  lm                          16242 non-null  object 
 73  lcm                         16242 non-null  object 
 74  cm                          16242 non-null  object 
 75  rcm                         16242 non-null  object 
 76  rm                          16242 non-null  object 
 77  lwb                         16242 non-null  object 
 78  ldm                         16242 non-null  object 
 79  cdm                         16242 non-null  object 
 80  rdm                         16242 non-null  object 
 81  rwb                         16242 non-null  object 
 82  lb                          16242 non-null  object 
 83  lcb                         16242 non-null  object 
 84  cb                          16242 non-null  object 
 85  rcb                         16242 non-null  object 
 86  rb                          16242 non-null  object 
dtypes: float64(13), int64(39), object(35)
memory usage: 12.1+ MB

C. Data Quality

Dataset Summary before cleanup

  1. Data types: float64(16), int64(45), object(43)

    • For 13 features in type of **float64**: All contain missing values
    • For 35 features in type of **object**: 27 with missing values.
    • For 39 features in type of **int64**: No missing values
  2. There is no spacing being used in column names, no need to change the column names.

  3. We will fix all these and the data conversion (continous, discret, ordinal, nominal) in the later section after doing some further analysis.

Conducting some descriptive data analysis and statistics

Players positional attributes

  1. We see the attribute player_postions is the various position a players plays in the clubs and has multiple values in them. In the data clean up we will extract the very first position in the comma seperated list as a new feature called preferred_position.
  2. nation_position is the position that a players plays back home in his nation team, it has 1226 missing values and just a unique position value. We will not use this attribute for analysis.
  3. team_postion is the current positon that the player is being used in the team or club.
In [5]:
df_players['player_positions'].value_counts()
df_players['nation_position'].value_counts()
df_players['team_position'].value_counts()
Out[5]:
CB              2322
GK              2036
ST              1809
CM               786
CDM, CM          731
                ... 
RM, CB, RB         1
LM, CF, CAM        1
CM, CB, RB         1
CM, LM, RB         1
CDM, CAM, RM       1
Name: player_positions, Length: 643, dtype: int64
Out[5]:
SUB    587
GK      49
RCB     49
LCB     49
RB      46
LB      46
ST      43
RCM     36
LCM     36
RM      29
LM      29
CAM     22
CDM     21
RW      19
LW      19
RDM     12
LDM     12
LS       6
RS       6
CB       3
CM       2
CF       1
RWB      1
RF       1
LWB      1
LF       1
Name: nation_position, dtype: int64
Out[5]:
SUB    7820
RES    2958
GK      662
RCB     660
LCB     660
RB      560
LB      560
ST      458
RCM     411
LCM     411
RM      399
LM      398
CAM     311
RDM     244
LDM     242
LS      195
RS      195
CDM     181
LW      162
RW      161
CB      100
CM       76
RWB      58
LWB      58
LAM      23
RAM      23
RF       19
LF       19
CF       14
Name: team_position, dtype: int64

Wages of players binned into ten slots

Note: There are a 240 players with 0 wages_eur and 250 with with 0 value_eur, we will fill these with median values for our predictions.

In [6]:
df_players['wage_eur'].value_counts(bins=10)
df_players['wage_eur'].loc[df_players.wage_eur <=0]
df_players['value_eur'].value_counts(bins=10)
df_players['value_eur'].loc[df_players.value_eur <=0]
Out[6]:
(-565.001, 56500.0]     17820
(56500.0, 113000.0]       310
(113000.0, 169500.0]       83
(169500.0, 226000.0]       38
(226000.0, 282500.0]       13
(282500.0, 339000.0]        7
(339000.0, 395500.0]        4
(508500.0, 565000.0]        1
(452000.0, 508500.0]        1
(395500.0, 452000.0]        1
Name: wage_eur, dtype: int64
Out[6]:
327      0
328      0
407      0
408      0
409      0
        ..
16353    0
16354    0
16356    0
16600    0
16714    0
Name: wage_eur, Length: 240, dtype: int64
Out[6]:
(-105500.001, 10550000.0]    17418
(10550000.0, 21100000.0]       571
(21100000.0, 31650000.0]       156
(31650000.0, 42200000.0]        63
(42200000.0, 52750000.0]        35
(52750000.0, 63300000.0]        16
(63300000.0, 73850000.0]         9
(73850000.0, 84400000.0]         5
(84400000.0, 94950000.0]         3
(94950000.0, 105500000.0]        2
Name: value_eur, dtype: int64
Out[6]:
327      0
328      0
407      0
408      0
409      0
        ..
16354    0
16356    0
16600    0
16714    0
18233    0
Name: value_eur, Length: 250, dtype: int64

Preferred foot

In [7]:
df_players['preferred_foot'].value_counts()
Out[7]:
Right    13960
Left      4318
Name: preferred_foot, dtype: int64

Body type

Note: It looks like seven players were not categorized properly. During data cleanup stage we will set them to be normal in case we decide to use it for prediction.

In [8]:
df_players['body_type'].value_counts()
Out[8]:
Normal                 10750
Lean                    6505
Stocky                  1016
Akinfenwa                  1
Courtois                   1
PLAYER_BODY_TYPE_25        1
Messi                      1
Shaqiri                    1
Neymar                     1
C. Ronaldo                 1
Name: body_type, dtype: int64

Number of players in each clubs

In [9]:
df_players['club'].value_counts()
Out[9]:
Real Valladolid CF            33
Athletic Club de Bilbao       33
Lecce                         33
Manchester City               33
FC Nantes                     33
                              ..
Mexico                         1
US Orléans Loiret Football     1
Colombia                       1
Turkey                         1
Poland                         1
Name: club, Length: 698, dtype: int64

1. player_positions attribute

As we mentioned above player_positions represents multiple positions that a player can play in with a comma seperate value, the very first item in that list has a preferred field position of the player. We will now extract the first value into preferred_position.

In [10]:
df_players['preferred_position'] = df_players.player_positions.str.split(',').apply(lambda x: x[0])

team_postion

We see the team_position has 241 empty values. Since we do not intent to use team_position nor nation_positon in our analysis, we will drop it.

In [11]:
df_players = df_players.drop(columns=['nation_position','team_position'])
In [12]:
df_players.shape
Out[12]:
(18278, 86)

Get total goal keepers vs regular players.

It is critical to properly identify goal keepers vs rest of the other players as many attributes are specific to either goal keepers or regular players. We will be using this a lot in our data cleanup and analysis later.

In [13]:
#goal keeper count
df_players[df_players['preferred_position'] =='GK'].shape

#regular player count
df_players[df_players['preferred_position'] !='GK'].shape
Out[13]:
(2036, 86)
Out[13]:
(16242, 86)

Categorizing to four player positions

  1. Defending (DEF)
  2. Forward (FWD)
  3. Midfield (MID)
  4. Goal Keeper (GK)

We will use this later to predict player positions

In [14]:
df_players.preferred_position.value_counts()
df_players['preferred_position_cat'] = df_players.preferred_position.map({
    
        'CB': 'DEF',   
        'ST': 'FWD',   
        'CM': 'MID',  
        'GK': 'GK',     
       'CDM': 'DEF',
        'RB': 'DEF',
        'LB': 'DEF',     
       'CAM': 'FWD',    
        'RM': 'MID',  
        'LM': 'MID',
        'LW': 'FWD',      
        'RW': 'FWD',      
        'CF': 'FWD',    
       'LWB': 'DEF',    
        'RWB': 'DEF'
    })
Out[14]:
CB     3162
ST     2582
CM     2193
GK     2036
CDM    1424
RB     1314
LB     1303
CAM    1146
RM     1050
LM     1049
LW      378
RW      369
CF      113
LWB      90
RWB      69
Name: preferred_position, dtype: int64
In [15]:
df_players.preferred_position_cat.value_counts()
Out[15]:
DEF    7362
FWD    4588
MID    4292
GK     2036
Name: preferred_position_cat, dtype: int64

2. Wages

As seen on analysis in the above section, there are 240 players with 0 wages_eur and 250 with with 0 value_eur, we will fill these with median values for our predictions.

In [16]:
# 1. wages, The fields wage_eur and value_eur have about 240 and 250 vales respectively set to 0 (Analyzed in 
# section B), we will set these values to the respective median values
df_players['wage_eur'].median()
df_players['wage_eur'].replace(0,df_players['wage_eur'].median(), inplace=True)
Out[16]:
3000.0
In [17]:
df_players['value_eur'].median()
df_players['value_eur'].replace(0,df_players['value_eur'].median(), inplace=True)
Out[17]:
700000.0

3. Body type

We identified six players were not categorized properly, so we will set them to be normal in case we decide to use it for prediction.

In [18]:
df_players['body_type'].replace({'Shaqiri': 'Normal', 'Akinfenwa': 'Normal'
                                , 'C. Ronaldo': 'Normal', 'PLAYER_BODY_TYPE_25': 'Normal'
                                , 'Neymar': 'Normal', 'Courtois': 'Normal','Messi':'Normal'}, inplace=True)
df_players['body_type'].value_counts()
Out[18]:
Normal    10757
Lean       6505
Stocky     1016
Name: body_type, dtype: int64

4. Refactor position features [Object to discrete]

The position features inculdes year over improvement / decrement. For example a player in st (Striker) position with a value of 89+2 is basically 89 with an increase (- would indicate decrease) of 2 from last year. Since we will not be doing any year over analysis, we will remove the suffix for these columns = ['ls','st','rs','lw','lf','cf','rf', 'rw','lam','cam','ram', 'lm','lcm','cm','rcm','rm','lwb','ldm', 'cdm','rdm', 'rwb','lb','lcb', 'cb','rcb','rb']

In [19]:
# 4. Refactor position features based on bullet 1 above to remove the +/- incriment/decriment and preseve 
# the current years score
df_players1 = df_players[df_position_attr].apply(lambda x: x.str.slice(start=0,stop=2), 
                          axis=1, result_type='broadcast')
df_players[df_position_attr] = df_players1

5. Dealing with missing values

The player positional features mentioned above and the skills features ['pace','shooting', 'passing','dribbling', 'defending','physic'] have a total of 2036 missing values. We see that all of them are categorized to play in the GK position (player_positions="GK", Goal Keeper). We will set these values to zero.

6. Goal Keeper specific attributes

The following features ['gk_diving','gk_handling','gk_kicking','gk_reflexes', 'gk_speed','gk_positioning'] are specific to goalkeepers. We see a total of 16242 mising values, which corrosponds to the non goalkeeper or regular positional players. We will set these values to zero. Also there are duplicate columns representing these same attributes which we will ignore ['goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes']

In [20]:
df_players.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18278 entries, 0 to 18277
Data columns (total 87 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   sofifa_id                   18278 non-null  int64  
 1   short_name                  18278 non-null  object 
 2   age                         18278 non-null  int64  
 3   dob                         18278 non-null  object 
 4   height_cm                   18278 non-null  int64  
 5   weight_kg                   18278 non-null  int64  
 6   nationality                 18278 non-null  object 
 7   club                        18278 non-null  object 
 8   overall                     18278 non-null  int64  
 9   potential                   18278 non-null  int64  
 10  value_eur                   18278 non-null  int64  
 11  wage_eur                    18278 non-null  int64  
 12  player_positions            18278 non-null  object 
 13  preferred_foot              18278 non-null  object 
 14  international_reputation    18278 non-null  int64  
 15  weak_foot                   18278 non-null  int64  
 16  body_type                   18278 non-null  object 
 17  release_clause_eur          16980 non-null  float64
 18  pace                        16242 non-null  float64
 19  shooting                    16242 non-null  float64
 20  passing                     16242 non-null  float64
 21  dribbling                   16242 non-null  float64
 22  defending                   16242 non-null  float64
 23  physic                      16242 non-null  float64
 24  gk_diving                   2036 non-null   float64
 25  gk_handling                 2036 non-null   float64
 26  gk_kicking                  2036 non-null   float64
 27  gk_reflexes                 2036 non-null   float64
 28  gk_speed                    2036 non-null   float64
 29  gk_positioning              2036 non-null   float64
 30  attacking_crossing          18278 non-null  int64  
 31  attacking_finishing         18278 non-null  int64  
 32  attacking_heading_accuracy  18278 non-null  int64  
 33  attacking_short_passing     18278 non-null  int64  
 34  attacking_volleys           18278 non-null  int64  
 35  skill_dribbling             18278 non-null  int64  
 36  skill_curve                 18278 non-null  int64  
 37  skill_fk_accuracy           18278 non-null  int64  
 38  skill_long_passing          18278 non-null  int64  
 39  skill_ball_control          18278 non-null  int64  
 40  movement_acceleration       18278 non-null  int64  
 41  movement_sprint_speed       18278 non-null  int64  
 42  movement_agility            18278 non-null  int64  
 43  movement_reactions          18278 non-null  int64  
 44  movement_balance            18278 non-null  int64  
 45  power_shot_power            18278 non-null  int64  
 46  power_jumping               18278 non-null  int64  
 47  power_stamina               18278 non-null  int64  
 48  power_strength              18278 non-null  int64  
 49  power_long_shots            18278 non-null  int64  
 50  mentality_aggression        18278 non-null  int64  
 51  mentality_interceptions     18278 non-null  int64  
 52  mentality_positioning       18278 non-null  int64  
 53  mentality_vision            18278 non-null  int64  
 54  mentality_penalties         18278 non-null  int64  
 55  mentality_composure         18278 non-null  int64  
 56  defending_marking           18278 non-null  int64  
 57  defending_standing_tackle   18278 non-null  int64  
 58  defending_sliding_tackle    18278 non-null  int64  
 59  ls                          16242 non-null  object 
 60  st                          16242 non-null  object 
 61  rs                          16242 non-null  object 
 62  lw                          16242 non-null  object 
 63  lf                          16242 non-null  object 
 64  cf                          16242 non-null  object 
 65  rf                          16242 non-null  object 
 66  rw                          16242 non-null  object 
 67  lam                         16242 non-null  object 
 68  cam                         16242 non-null  object 
 69  ram                         16242 non-null  object 
 70  lm                          16242 non-null  object 
 71  lcm                         16242 non-null  object 
 72  cm                          16242 non-null  object 
 73  rcm                         16242 non-null  object 
 74  rm                          16242 non-null  object 
 75  lwb                         16242 non-null  object 
 76  ldm                         16242 non-null  object 
 77  cdm                         16242 non-null  object 
 78  rdm                         16242 non-null  object 
 79  rwb                         16242 non-null  object 
 80  lb                          16242 non-null  object 
 81  lcb                         16242 non-null  object 
 82  cb                          16242 non-null  object 
 83  rcb                         16242 non-null  object 
 84  rb                          16242 non-null  object 
 85  preferred_position          18278 non-null  object 
 86  preferred_position_cat      18278 non-null  object 
dtypes: float64(13), int64(39), object(35)
memory usage: 12.1+ MB
In [21]:
# 5, 6. we will set all the other missing variables to 0 as explained above in bullet 4 & 5
df_players.fillna(0, inplace=True)

Data type conversion

We will first convert our continuous features.

In [22]:
# changing the continuous values to be float64
continuous_features = ['height_cm','weight_kg','value_eur', 'wage_eur']
df_players[continuous_features] = df_players[continuous_features].astype(np.float64)

sofifa_id is converted to object as it is a nominal

In [23]:
df_players['sofifa_id'] = df_players['sofifa_id'].astype(np.object)

We will convert object features that are discrete to integer

In [24]:
#change the position features from object to int64 as they we manipulated to remove +/- based on step 1
df_players[df_position_attr] = df_players[df_position_attr].astype(np.int64)
df_players.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18278 entries, 0 to 18277
Data columns (total 87 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   sofifa_id                   18278 non-null  object 
 1   short_name                  18278 non-null  object 
 2   age                         18278 non-null  int64  
 3   dob                         18278 non-null  object 
 4   height_cm                   18278 non-null  float64
 5   weight_kg                   18278 non-null  float64
 6   nationality                 18278 non-null  object 
 7   club                        18278 non-null  object 
 8   overall                     18278 non-null  int64  
 9   potential                   18278 non-null  int64  
 10  value_eur                   18278 non-null  float64
 11  wage_eur                    18278 non-null  float64
 12  player_positions            18278 non-null  object 
 13  preferred_foot              18278 non-null  object 
 14  international_reputation    18278 non-null  int64  
 15  weak_foot                   18278 non-null  int64  
 16  body_type                   18278 non-null  object 
 17  release_clause_eur          18278 non-null  float64
 18  pace                        18278 non-null  float64
 19  shooting                    18278 non-null  float64
 20  passing                     18278 non-null  float64
 21  dribbling                   18278 non-null  float64
 22  defending                   18278 non-null  float64
 23  physic                      18278 non-null  float64
 24  gk_diving                   18278 non-null  float64
 25  gk_handling                 18278 non-null  float64
 26  gk_kicking                  18278 non-null  float64
 27  gk_reflexes                 18278 non-null  float64
 28  gk_speed                    18278 non-null  float64
 29  gk_positioning              18278 non-null  float64
 30  attacking_crossing          18278 non-null  int64  
 31  attacking_finishing         18278 non-null  int64  
 32  attacking_heading_accuracy  18278 non-null  int64  
 33  attacking_short_passing     18278 non-null  int64  
 34  attacking_volleys           18278 non-null  int64  
 35  skill_dribbling             18278 non-null  int64  
 36  skill_curve                 18278 non-null  int64  
 37  skill_fk_accuracy           18278 non-null  int64  
 38  skill_long_passing          18278 non-null  int64  
 39  skill_ball_control          18278 non-null  int64  
 40  movement_acceleration       18278 non-null  int64  
 41  movement_sprint_speed       18278 non-null  int64  
 42  movement_agility            18278 non-null  int64  
 43  movement_reactions          18278 non-null  int64  
 44  movement_balance            18278 non-null  int64  
 45  power_shot_power            18278 non-null  int64  
 46  power_jumping               18278 non-null  int64  
 47  power_stamina               18278 non-null  int64  
 48  power_strength              18278 non-null  int64  
 49  power_long_shots            18278 non-null  int64  
 50  mentality_aggression        18278 non-null  int64  
 51  mentality_interceptions     18278 non-null  int64  
 52  mentality_positioning       18278 non-null  int64  
 53  mentality_vision            18278 non-null  int64  
 54  mentality_penalties         18278 non-null  int64  
 55  mentality_composure         18278 non-null  int64  
 56  defending_marking           18278 non-null  int64  
 57  defending_standing_tackle   18278 non-null  int64  
 58  defending_sliding_tackle    18278 non-null  int64  
 59  ls                          18278 non-null  int64  
 60  st                          18278 non-null  int64  
 61  rs                          18278 non-null  int64  
 62  lw                          18278 non-null  int64  
 63  lf                          18278 non-null  int64  
 64  cf                          18278 non-null  int64  
 65  rf                          18278 non-null  int64  
 66  rw                          18278 non-null  int64  
 67  lam                         18278 non-null  int64  
 68  cam                         18278 non-null  int64  
 69  ram                         18278 non-null  int64  
 70  lm                          18278 non-null  int64  
 71  lcm                         18278 non-null  int64  
 72  cm                          18278 non-null  int64  
 73  rcm                         18278 non-null  int64  
 74  rm                          18278 non-null  int64  
 75  lwb                         18278 non-null  int64  
 76  ldm                         18278 non-null  int64  
 77  cdm                         18278 non-null  int64  
 78  rdm                         18278 non-null  int64  
 79  rwb                         18278 non-null  int64  
 80  lb                          18278 non-null  int64  
 81  lcb                         18278 non-null  int64  
 82  cb                          18278 non-null  int64  
 83  rcb                         18278 non-null  int64  
 84  rb                          18278 non-null  int64  
 85  preferred_position          18278 non-null  object 
 86  preferred_position_cat      18278 non-null  object 
dtypes: float64(17), int64(60), object(10)
memory usage: 12.1+ MB

Dataset Summary after cleanup

Data types: float64(17), int64(60), object(10)
No missing values

Our cleaned dataset copy with all players

In [25]:
# Our cleaned dataset **copy** is df_players_cleaned
df_players_cleaned = df_players.copy()
df_players_cleaned.shape
Out[25]:
(18278, 87)

We will also have a view for goal keepers as df_players_gk

In [26]:
#We will now also have **views** for goal keepers in df_players_gk and regular 
df_players_gk = df_players_cleaned[df_players_cleaned['player_positions'] =='GK']
df_players_gk.shape
Out[26]:
(2036, 87)

We will also have a view for non-goal keepers as df_players_regular

In [27]:
df_players_regular = df_players_cleaned[df_players_cleaned['player_positions'] !='GK']
df_players_regular.shape
Out[27]:
(16242, 87)

Simple Statistics

We selected some attributes to explore.

In [28]:
# we are grouping different types of atributes into different feature groups
player_continuous_features = ['age','height_cm','weight_kg','value_eur', 'wage_eur']

Age

The age range is betwee 16 and 42 with average 25. It is interesting that the player can be as young as 16 to be a professional soccer player.

Wage

The average wage is 9496 euro with very high standard deviation 21336 euro. It makes sense that wage vary and super stars can earn a lot more, but it is surprising that many of the professional player are not getting high wages. Also, the maximum weight is 110. We will look into this outliner.

In [29]:
# we are adding more percentile to the describe
df_players_cleaned[player_continuous_features].describe([.05,.1,.25,.5,.75,.9,.95])
Out[29]:
age height_cm weight_kg value_eur wage_eur
count 18278.000000 18278.000000 18278.000000 1.827800e+04 18278.000000
mean 25.283291 181.362184 75.276343 2.493612e+06 9496.334391
std 4.656964 6.756961 7.047744 5.581813e+06 21336.992174
min 16.000000 156.000000 50.000000 1.000000e+04 1000.000000
5% 19.000000 170.000000 64.000000 1.100000e+05 1000.000000
10% 19.000000 173.000000 66.000000 1.500000e+05 1000.000000
25% 22.000000 177.000000 70.000000 3.500000e+05 1000.000000
50% 25.000000 181.000000 75.000000 7.000000e+05 3000.000000
75% 29.000000 186.000000 80.000000 2.100000e+06 8000.000000
90% 32.000000 190.000000 85.000000 6.500000e+06 23000.000000
95% 33.000000 192.000000 87.000000 1.050000e+07 38000.000000
max 42.000000 205.000000 110.000000 1.055000e+08 565000.000000

Ability Attributes: all positions

The other attributes we selected are related to the abilities. The mentality, attacking and movement abilities attribute are applied to all positions. The minimum scores are very low and the standard deviation is farily high because some positions don't require skills that are in different positions. For instance, a goal keeper doesn't require much of the attacking skills. This statistics run is important because we realize position is an important factor of the skill scores. We can run analysis by positions separately.

In [30]:
df_players_cleaned[df_mental_attr].describe([.05,.1,.25,.5,.75,.9,.95])
df_players_cleaned[df_attacking_attr].describe([.05,.1,.25,.5,.75,.9,.95])
df_players_cleaned[df_movement_attr].describe([.05,.1,.25,.5,.75,.9,.95])
Out[30]:
mentality_aggression mentality_interceptions mentality_positioning mentality_vision mentality_penalties mentality_composure
count 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000
mean 55.742149 46.380239 50.072163 53.609749 48.383357 58.528778
std 17.318157 20.775812 19.594022 13.955626 15.708099 11.880840
min 9.000000 3.000000 2.000000 9.000000 7.000000 12.000000
5% 24.000000 13.000000 10.000000 29.000000 18.000000 37.000000
10% 29.000000 17.000000 16.000000 33.000000 25.000000 43.000000
25% 44.000000 25.000000 39.000000 44.000000 39.000000 51.000000
50% 58.000000 52.000000 55.000000 55.000000 49.000000 60.000000
75% 69.000000 64.000000 64.000000 64.000000 60.000000 67.000000
90% 76.000000 71.000000 71.000000 71.000000 68.000000 73.000000
95% 80.000000 74.000000 75.000000 74.000000 72.000000 76.000000
max 95.000000 92.000000 95.000000 94.000000 92.000000 96.000000
Out[30]:
attacking_crossing attacking_finishing attacking_heading_accuracy attacking_short_passing attacking_volleys
count 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000
mean 49.718405 45.590218 52.221468 58.748003 42.809388
std 18.325403 19.594609 17.428429 14.679653 17.701815
min 5.000000 2.000000 5.000000 7.000000 3.000000
5% 13.000000 11.000000 13.000000 25.000000 11.000000
10% 19.000000 16.000000 19.000000 34.000000 17.000000
25% 38.000000 30.000000 44.000000 54.000000 30.000000
50% 54.000000 49.000000 56.000000 62.000000 44.000000
75% 64.000000 62.000000 64.000000 68.000000 56.000000
90% 70.000000 69.000000 71.000000 74.000000 66.000000
95% 74.000000 73.000000 75.000000 77.000000 70.000000
max 93.000000 95.000000 93.000000 92.000000 90.000000
Out[30]:
movement_acceleration movement_sprint_speed movement_agility movement_reactions movement_balance
count 18278.000000 18278.000000 18278.000000 18278.000000 18278.000000
mean 64.299923 64.415746 63.504924 61.752544 63.856439
std 15.042232 14.847763 14.808380 9.135613 14.201559
min 12.000000 11.000000 11.000000 21.000000 12.000000
5% 33.000000 34.000000 34.000000 47.000000 36.000000
10% 42.000000 42.000000 40.000000 50.000000 43.000000
25% 56.000000 57.000000 55.000000 56.000000 56.000000
50% 67.000000 67.000000 66.000000 62.000000 66.000000
75% 75.000000 75.000000 74.000000 68.000000 74.000000
90% 81.000000 81.000000 80.000000 73.000000 80.000000
95% 85.000000 85.000000 84.000000 76.000000 84.000000
max 97.000000 96.000000 96.000000 96.000000 97.000000

Ability Attributes: regular positions only

We observed that the range of the technical skills for the regular players are pretty high. The average of the scores are around 60, so we anticipate there are some outliers with lower scores. For the pace, passing dribbling, and physic, the lowest 5% of the player have scores around 40 and the top 5% of the players have scores around 80. For the shooting and defending, the standard deviation is higher because these skills are more position specific.

In [31]:
df_players_regular[df_technical_attr].describe([.05,.1,.25,.5,.75,.9,.95])
Out[31]:
pace shooting passing dribbling defending physic
count 16242.000000 16242.000000 16242.000000 16242.000000 16242.000000 16242.000000
mean 67.700899 52.298301 57.233777 62.531585 51.553503 64.876678
std 11.297656 14.029418 10.407844 10.284950 16.419528 9.760162
min 24.000000 15.000000 24.000000 23.000000 15.000000 27.000000
5% 47.000000 27.000000 39.000000 42.000000 24.000000 47.000000
10% 53.000000 31.000000 43.000000 49.000000 27.000000 51.000000
25% 61.000000 42.000000 50.000000 57.000000 36.000000 59.000000
50% 69.000000 54.000000 58.000000 64.000000 56.000000 66.000000
75% 75.000000 63.000000 64.000000 69.000000 65.000000 72.000000
90% 81.000000 69.000000 70.000000 74.000000 70.000000 77.000000
95% 85.000000 73.000000 73.000000 77.000000 74.000000 79.000000
max 96.000000 93.000000 92.000000 96.000000 90.000000 90.000000

Ability Attributes: goal keeper position only

Except the speed, the average scores are around 65. The score of speed is much lower than the other goalkeeper skills because goal keeper doesn't need to run.

In [32]:
df_players_gk[df_gk_attr].describe([.05,.1,.25,.5,.75,.9,.95])
Out[32]:
gk_diving gk_handling gk_kicking gk_reflexes gk_speed gk_positioning
count 2036.000000 2036.000000 2036.000000 2036.000000 2036.000000 2036.000000
mean 65.422397 63.146365 61.832515 66.390472 37.798625 63.374754
std 7.736278 7.244023 7.510709 8.154062 10.634038 8.447876
min 44.000000 42.000000 35.000000 45.000000 12.000000 41.000000
5% 53.000000 52.000000 50.000000 54.000000 21.000000 49.000000
10% 56.000000 54.000000 53.000000 56.000000 23.000000 52.000000
25% 60.000000 58.000000 57.000000 60.750000 29.000000 58.000000
50% 65.000000 63.000000 61.000000 66.000000 39.000000 64.000000
75% 70.000000 68.000000 66.000000 72.000000 46.000000 69.000000
90% 76.000000 72.000000 72.000000 77.000000 51.000000 74.000000
95% 79.000000 76.000000 75.000000 80.250000 55.000000 77.000000
max 90.000000 92.000000 93.000000 92.000000 65.000000 91.000000

Visualize Attributes

Wages by position

In the box plot below we see the median wage for a player in center forward is the highest and the distribution for CF for value is larger too. The players with left foot are paid more on average in some positions. For example "RW", "RM", "LWB", where as in "CF", "RB" it is the opposite. We are not sure the reasoning.

In [33]:
#Plotting wages distribution on log scale by position
plt.figure(figsize=(20,5))
ax = sns.boxplot(data=df_players, y='wage_eur', x='preferred_position', hue='preferred_foot');
ax.set_yscale('log');
ax.set_title('Wages grouped by preferred position & preferred foot', fontsize=20);
ax.set_xlabel('preferred Position', fontsize=15);
ax.set_ylabel('Wage (€)', fontsize=15);

# plotting values distribution with sns
plt.figure(figsize=(20,5));
ax1 = sns.boxplot(data=df_players, y='value_eur', x='preferred_position', hue='preferred_foot');
ax1.set_yscale('log');
ax1.set_title('Values grouped by preferred position & preferred foot', fontsize=20);
ax1.set_xlabel('Preferred Position', fontsize=15);
ax1.set_ylabel('Value (€)', fontsize=15);

#df_players['wage_eur'].describe()

Statistics on player height in various positions

Goal keepers and center back on average are taller than other players.

In [34]:
plt.figure(figsize=(20,5));
sns.set();
ax_height = sns.violinplot(y=df_players['height_cm'], x=df_players['preferred_position'], dodge=True);

ax_height.set_title('Height vs Position', fontsize=20);
ax_height.set_xlabel('Preferred Position', fontsize=15);
ax_height.set_ylabel('Height (cm)', fontsize=15);

Statistics on player weight in various positions

Right and Left wing players on average are lighter than other players. The reason could be the wing players often need to run back and forth to the goalline.

In [35]:
plt.figure(figsize=(20,5));
sns.set();
ax_weight = sns.violinplot(y=df_players['weight_kg'], x=df_players['preferred_position'], dodge=True);

ax_weight.set_title('Weight vs Position', fontsize=20);
ax_weight.set_xlabel('Preferred Position', fontsize=15);
ax_weight.set_ylabel('Weight (kg)', fontsize=15);

Average rating by age (Potential, Overall and Age)

In the plot below, we can see the gap between the potential of a player and his overall rating reduces as the player age increases. By the average age of 28, they merge and becomes the same. This makes sence as the potential of the younger players is higher. As the players age, they come closer to their overall rating.
We see a spike in the overall rating at age 41 due to the fact that there are only few players at that age. The few high ratings are pulling the average up.

In [36]:
df_players_cleaned_p = df_players_cleaned.groupby(['age'])['potential'].mean()
df_players_cleaned_o = df_players_cleaned.groupby(['age'])['overall'].mean()
df_players_cleaned_summary = pd.concat([df_players_cleaned_p, df_players_cleaned_o], axis=1)

ax_summary = df_players_cleaned_summary.plot();
ax_summary.set_ylabel('Rating');
ax_summary.set_xlabel('Age');
ax_summary.set_title('Average Rating by Age');
plt.show();

Map visualization of player distribution by country

şş We use plotly (https://plotly.com/python/choropleth-maps/) to plot the count of player by country.

In [37]:
#Map - to show how many players by country
import warnings
warnings.filterwarnings("ignore")
pdf7 = df_players_cleaned
pdf7 = pdf7[['nationality']]
nat = []
for i in range(len(pdf7)):    
    nat.append(1)
pdf7['Number of players'] = nat


for i in range(len(pdf7)):
    if   pdf7.nationality[i] == 'Antigua & Barbuda':
         pdf7.nationality[i] = 'Antigua and Barbuda'
    elif pdf7.nationality[i] == 'Bosnia Herzegovina':
         pdf7.nationality[i] = 'Bosnia and Herzegovina'
    elif pdf7.nationality[i] == 'Cape Verde':
         pdf7.nationality[i] = 'Republic of Cabo Verde'
    elif pdf7.nationality[i] == 'Central African Rep.':
         pdf7.nationality[i] = 'Central African Republic'
    elif pdf7.nationality[i] == 'China PR':
         pdf7.nationality[i] = 'China'
    elif pdf7.nationality[i] == 'Chinese Taipei':
         pdf7.nationality[i] = 'Taiwan'
    elif pdf7.nationality[i] == 'DR Congo':
         pdf7.nationality[i] = 'Congo'
    elif pdf7.nationality[i] == 'Democratic Republic of the Congo':
         pdf7.nationality[i] = 'Congo'
    elif pdf7.nationality[i] == 'FYR Macedonia':
         pdf7.nationality[i] = 'Macedonia'
    elif pdf7.nationality[i] == 'Guinea Bissau':
         pdf7.nationality[i] = 'Guinea-Bissau'
    elif pdf7.nationality[i] == 'Trinidad & Tobago':
         pdf7.nationality[i] = 'Trinidad and Tobago'
    elif pdf7.nationality[i] == 'São Tomé & Príncipe':
         pdf7.nationality[i] = 'São Tomé and Príncipe'
    elif pdf7.nationality[i] == 'Ivory Coast':
         pdf7.nationality[i] = "Côte d'Ivoire"
    elif pdf7.nationality[i] == 'Korea DPR':
         pdf7.nationality[i] = "Democratic People's Republic of Korea"
    elif pdf7.nationality[i] == 'Korea Republic':
         pdf7.nationality[i] = "Republic of Korea"
    elif pdf7.nationality[i] == 'Macau':
         pdf7.nationality[i] = 'China'
    elif pdf7.nationality[i] == 'Republic of Ireland':
         pdf7.nationality[i] = 'Ireland'
    elif pdf7.nationality[i] == 'St Kitts Nevis':
         pdf7.nationality[i] = 'Saint Kitts and Nevis' 
    elif pdf7.nationality[i] == 'St Lucia':
         pdf7.nationality[i] = 'Saint Lucia'
    elif pdf7.nationality[i] == 'England':
         pdf7.nationality[i] = 'United Kingdom'
    elif pdf7.nationality[i] == 'Northern Ireland':
         pdf7.nationality[i] = 'United Kingdom'
    elif pdf7.nationality[i] == 'Scotland':
         pdf7.nationality[i] = 'United Kingdom'
    elif pdf7.nationality[i] == 'Wales':
         pdf7.nationality[i] = 'United Kingdom'
            
pdf8 = pdf7.groupby('nationality', as_index=False).sum()
pdf8 = pd.DataFrame(pdf8)
            
list_countries = pdf8['nationality'].unique().tolist()
d_country_code = {} 
for country in list_countries:
    try:
        country_data = pycountry.countries.search_fuzzy(country)
        country_code = country_data[0].alpha_3
        d_country_code.update({country: country_code})
    except:
        print('We could not add ISO 3 code for ->', country)
        d_country_code.update({country: ' '})

for k, v in d_country_code.items():
    pdf8.loc[(pdf8.nationality == k), 'ISO3'] = v



fig = px.choropleth(data_frame = pdf8, 
                    locations= "ISO3",
                    color= 'Number of players', 
                    hover_name= "nationality",
                    #color_continuous_scale = 'Plasma',
                    color_continuous_scale= ["white","green","blue"], 
                    title = 'Number of players per country',
                    )


fig.show()

By this map, we can see that the country with the most number of players is UK. (Here we regroup England, Northern Ireland, Scotland and Wales as UK). The second country is Germany (dark green). It is an interactive map that users can click in each country to see the ISO code and number of players.

In [38]:
pdf8.sort_values(by=['Number of players'], ascending=False).head(10)
Out[38]:
nationality Number of players ISO3
149 United Kingdom 2142 GBR
54 Germany 1216 DEU
132 Spain 1035 ESP
50 France 984 FRA
5 Argentina 886 ARG
18 Brazil 824 BRA
75 Italy 732 ITA
28 Colombia 591 COL
77 Japan 453 JPN
103 Netherlands 416 NLD

Explore Joint Attributes

We will take a look at some correlation between various players skill attributes based on different categories and particularly see how they influence a player's value log scale in terms of euros. We will run an analysis seperately for goal keepers and regular players.

In [39]:
#prepare the plot pallete
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
In [40]:
# log value
df_reg_copy = df_players_regular.copy()
df_reg_copy['lvalue_eur'] = np.log(df_reg_copy['value_eur'])

Evaluating players basic overall and technical skills for relation to their value in euro.

  1. Density plot shows left foot players have a slightly higher distribution at the top spectrum for pace, passing and defending.
  2. Right foot players have a slightly higher distribution at the top spectrum for pace shooting.
  3. Left foot players have a slightly higher distribution at the top spectrum for overall skill.
  4. Shooting, passing and dribbling are linearly correlated with each other. We also see a similar correlation between these three attributes with value_eur that indicates those three skills are more in demand.
  5. There is a very strong linear relation between the players overall skill and value_eur.
  6. In the matrix between value_eur and overall, we do see a straight line at log(700000.0), this is due to the fact that we filled those misisng (0 values) with the median value during our data cleanup.
In [41]:
#analyse Technical skills of regular Non GK
l=df_technical_attr.append(pd.Series(['overall','lvalue_eur','preferred_foot']))
sns.pairplot(df_reg_copy[l], height=2, hue='preferred_foot');

Below is heat map to show the correlation, which confirms the correlation we have seen in the previous matrix plot between value_eur, passing, dribbling, and overall.

In [42]:
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_reg_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();
  1. Density plot shows left foot players have a slightly higher distribution at the top spectrum for attacking_crossing. It is possible that left footers are better with both feet.
In [43]:
#analyse player attacking
l=df_attacking_attr.append(pd.Series(['lvalue_eur','preferred_foot']))
sns.pairplot(df_reg_copy[l], height=2, hue='preferred_foot');

Below is the heat map to show the correlation between the attacking attributes.

In [44]:
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_reg_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();

Analyzing non goal keepers for defending category skills

  1. We do not see much of a correlation between the defending attributes with value_eur.
  2. we see a binomial distribution for marking, standing and sliding tacle.
  3. Left foot players have a higher distribution at the top spectrum for marking, standing and sliding tacle
In [45]:
#analyse player defending
l=df_defending_attr.append(pd.Series(['lvalue_eur','preferred_foot']));
sns.pairplot(df_reg_copy[l], height=2,hue='preferred_foot');

Below is a heat map to show the correlation between defending attributes.

In [46]:
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_reg_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();

The matrix plot below of the goal keepers primary skill sets show the following.

  1. Density plot shows left footed goal keepers have a higher distribution at top end for kicking and positioning skills.
  2. Right footed goal keepers have a higher distribution at top end for reflexes and positioning skills.
  3. There is very strong positive linear correlation between goal keepers' value_eur and diving, handling, positioning and reflexes.
  4. The relation between speed and value_eur is not very prominent which makes sence as a goal keeper does not need to run around rather defending the goal.
In [47]:
#analyse GK skills
df_gk_copy = df_players_gk.copy()
df_gk_copy['lvalue_eur'] = np.log(df_gk_copy['value_eur'])
l=df_gk_attr.append(pd.Series(['lvalue_eur','preferred_foot']))
sns.pairplot(df_gk_copy[l], height=2,hue='preferred_foot');

Below is a heat map to show corealtion between gk attributes.

In [48]:
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_gk_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();

Explore Attributes And Class

We plan to use the attacking, skill, movement, power and defending features to classify players' positions into the following four categories.

  1. DEF (All players in the backfield defending the goal)
  2. MD (Players in the mid field)
  3. FWD (Players in the attacking position in the forward position).
  4. GK (Goal Keepers)

We have explored the correlation of each attribute in the previous section. The density plots are generated to observe the distribution of each attribute in different positions.

The score distribution is highly correlated to the position that players are playing. We anticipate it is because each position focus on specific skills, i.e. forward players are good at finishing, mid field players are good at passing, and defending players are good at tackle.

In all the density plots, the goal keeper players generally have the lowest score because the goal keeper position should be specilized in certain goal keeper skill sets, rather then other normal features.

Defending players have much higher score among the defending attributes, marking, standing tackle and sliding tackle.

Forward players have higher score in attributes mentality positioning, mentality penalties, and attacking finishing.

Mid field players have higher score in attributes mentality vision, attacking crossing, and attacking short passing.

In some attributes, mid field players have smiliar distribution with forward players. For example, skill dribbling, skill ball control, power shot power, power jumping, power long shots.

In some attributes, defending, foward and mid field players have smiliar distribution. For example, mentality composure, movement acceleration, movement sprint speed, movement reactions and power stamina.

Mid field and defending players generally have higher score in skill long passing than forward players. Forward players don't usually do long passing, so it makes sense.

In [49]:
#Function to plot densidy plots of the passed dataframe, attributes and figure size 
def plot_density(df,attributes):
    fig = plt.figure(figsize=(12, (len(attributes)/2)*2))
    for index, plot_vars in enumerate(attributes):
        ax = plt.subplot(len(attributes)/2, 3, index+1)
        df_DEF = df.loc[df["preferred_position_cat"] == "DEF"];
        df_MID = df.loc[df["preferred_position_cat"] == "MID"];
        df_FWD = df.loc[df["preferred_position_cat"] == "FWD"];
        df_GK = df.loc[df["preferred_position_cat"] == "GK"];
        #fig, ax = plt.subplots();
        ax = sns.kdeplot(data=df_DEF[plot_vars], label='DEF', ax=ax)
        ax = sns.kdeplot(data=df_MID[plot_vars], label='MID', ax=ax)
        ax = sns.kdeplot(data=df_FWD[plot_vars], label='FWD', ax=ax)
        ax = sns.kdeplot(data=df_GK[plot_vars], label='GK', ax=ax)
        ax.set(xlabel=plot_vars);
    plt.tight_layout();

    plt.show()      
In [50]:
l=df_mental_attr
plot_density(df_players_cleaned,l)
In [51]:
l=df_attacking_attr
plot_density(df_players_cleaned,l)
In [52]:
l=df_skill_attr
plot_density(df_players_cleaned,l)
In [53]:
l=df_movement_attr
plot_density(df_players_cleaned,l)
In [54]:
l=df_power_attr
plot_density(df_players_cleaned,l)

We will use all of the above attributes to run our classification model. It is in the Exceptional section

New Features

We plan to use https://fbref.com/en/comps/9/Premier-League-Stats to get more data with respect to play/ wins/ goal per club and leagues data to do more analyses in the future labs.

Clubs comparisons within top three leagues (Non Goal Keepers)

Now we will compare the top performing and the bottom perfroming teams in the top three leagues for 2020 to see how much they differ in the main skills of players in regular postions, which we can use to build an evaluation of players' salary.
We will than use this analysis to come with a player budget to build a new team of our own and comapre to the top three teams average skill set..
şş https://fbref.com/en/comps/20/Bundesliga-Stats The current top three leagues with the top ranked and bottom ranked are as follows

  1. English Premier League (1.Liverpool, 20.Norwick City )
  2. Bundesliga (1.Bayern Munich, 18.Paderborn)
  3. La Liga (1. Barcelona, 20. Espanyol) We will plot a comparison of the top ranked vs bottom ranked within each league.

şş https://python-graph-gallery.com/391-radar-chart-with-several-individuals/ We used this site to get idea of creating spider plot.

In [55]:
#Common method to plot a comparative spider graph for two teams passed in.
# df: the data frame with the two team values to compare
# attributes: the corrosponding attributes these values belong to on which the plot will be based
# league: the name of the two leagues
def plot_spider(df,attributes, league):
    categories=list(np.array(attributes))
    #get the two club names
    teams = df['club']
    #categories=list(df_players_regular.loc[0,labels].values)
    N = len(categories)

    # What will be the angle of each axis in the plot? (we divide the plot / number of variable)
    #angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]

    # Initialise the spider plot
    fig = plt.figure();
    ax = fig.add_subplot(111, polar=True)

    # If you want the first axis to be on top:
    ax.set_theta_offset(pi / 2)
    ax.set_theta_direction(-1)

    # Draw one axe per variable + add labels labels yet
    plt.xticks(angles[:-1], categories)

    # ------- PART 2: Add plots
    # we can loop this to make it generic
    # Team 1
    values=df.loc[0,categories].tolist()
    values += values[:1]
    ax.plot(angles, values, linewidth=1, linestyle='solid', label=teams[0])
    ax.fill(angles, values, 'b', alpha=0.1)

    # Team 2
    values=df.loc[1,categories].tolist()
    values += values[:1]
    ax.plot(angles, values, linewidth=1, linestyle='solid', label=teams[1])
    ax.fill(angles, values, 'r', alpha=0.1)
    #ax.set_title("Spider", )
    fig.suptitle(league, fontsize=20)
    plt.subplots_adjust(top=0.85)
    # Add legend
    plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1));

We compare the top team vs bottom team in the English premier league.

In [56]:
#Filter our targeted clubs 
df_target_clubs_liverpool = df_players_regular[df_players_regular['club'].isin(['Liverpool','Norwich City', 
                                  'FC Bayern München','SC Paderborn 07', 'FC Barcelona','RCD Espanyol'])]
#Use the Technical skill set from metadata file
l=df_technical_attr.append(pd.Series(['club']))
#Get groupwise mean for the teams by club
df_target_clubs_liverpool = df_target_clubs_liverpool[l]
df_target_clubs_liverpool = df_target_clubs_liverpool.groupby('club').mean().reset_index()
#df_target_clubs_liverpool
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['Liverpool','Norwich City'])].reset_index()
plot_spider(d,df_technical_attr, 'English Premier');

Below is a similar plot comparison for Bundesliga league.

In [57]:
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['FC Bayern München','SC Paderborn 07'])].reset_index()
plot_spider(d,df_technical_attr, 'Bundesliga');

Below is a plot comparison for La Liga league. The gap between the highest performaing club vs lowest in this league is vsmaller.

In [58]:
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['FC Barcelona','RCD Espanyol'])].reset_index()
plot_spider(d,df_technical_attr, 'La Liga');

Comparing Liverpool (Ranked 1 in top league) vs Barcelona (Ranked 1 in league rated 3)

We see that Barcelona edges Liverpool slightly in technical skills although the league is ranked the third.

In [59]:
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['Liverpool','FC Barcelona'])].reset_index()
plot_spider(d,df_technical_attr, 'Liverpool vs Barcelona');

Building our team

We will first derive our budget estimate based on all players in our dataset using the mean player wage, which comes out to roughly 105K eur per month.

In [60]:
# See the mean salary of all players in dataset
df_players_regular['wage_eur'].mean()
df_players_gk['wage_eur'].mean()
# calucate 10 member players + goal keeper mean salary
budget = 9843*10 + 6726
budget
df_players_regular.info
df_players_regular.boxplot(column='wage_eur', by = 'preferred_position',figsize=(13, 6))
Out[60]:
9843.492180765916
Out[60]:
6726.915520628684
Out[60]:
105156
Out[60]:
<bound method DataFrame.info of       sofifa_id         short_name  age         dob  height_cm  weight_kg  \
0        158023           L. Messi   32  1987-06-24      170.0       72.0   
1         20801  Cristiano Ronaldo   34  1985-02-05      187.0       83.0   
2        190871          Neymar Jr   27  1992-02-05      175.0       68.0   
4        183277          E. Hazard   28  1991-01-07      175.0       74.0   
5        192985       K. De Bruyne   28  1991-06-28      181.0       70.0   
...         ...                ...  ...         ...        ...        ...   
18273    245006         Shao Shuai   22  1997-03-10      186.0       79.0   
18274    250995       Xiao Mingjie   22  1997-01-01      177.0       66.0   
18275    252332          Zhang Wei   19  2000-05-16      186.0       75.0   
18276    251110       Wang Haijian   18  2000-08-02      185.0       74.0   
18277    233449         Pan Ximing   26  1993-01-11      182.0       78.0   

      nationality                           club  overall  potential  ...  \
0       Argentina                   FC Barcelona       94         94  ...   
1        Portugal                       Juventus       93         93  ...   
2          Brazil            Paris Saint-Germain       92         92  ...   
4         Belgium                    Real Madrid       91         91  ...   
5         Belgium                Manchester City       91         91  ...   
...           ...                            ...      ...        ...  ...   
18273    China PR               Beijing Renhe FC       48         56  ...   
18274    China PR               Shanghai SIPG FC       48         56  ...   
18275    China PR         Hebei China Fortune FC       48         56  ...   
18276    China PR  Shanghai Greenland Shenhua FC       48         54  ...   
18277    China PR         Hebei China Fortune FC       48         51  ...   

       cdm  rdm rwb  lb  lcb  cb rcb  rb  preferred_position  \
0       66   66  68  63   52  52  52  63                  RW   
1       61   61  65  61   53  53  53  61                  ST   
2       61   61  66  61   46  46  46  61                  LW   
4       63   63  66  61   49  49  49  61                  LW   
5       77   77  77  73   66  66  66  73                 CAM   
...    ...  ...  ..  ..  ...  ..  ..  ..                 ...   
18273   42   42  43  45   46  46  46  45                  CB   
18274   43   43  44  46   47  47  47  46                  CB   
18275   49   49  47  47   49  49  49  47                  CM   
18276   48   48  48  48   49  49  49  48                  CM   
18277   49   49  48  48   50  50  50  48                  CM   

       preferred_position_cat  
0                         FWD  
1                         FWD  
2                         FWD  
4                         FWD  
5                         FWD  
...                       ...  
18273                     DEF  
18274                     DEF  
18275                     MID  
18276                     MID  
18277                     MID  

[16242 rows x 87 columns]>
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a21cf8e50>

Now we take the top team in each of the three league and calculate the mean salary for them which comes out to be 117K per month. we will also check the overall skill mean and wage distribution between positions for these teams. The top three teams have 79 players in all.

In [61]:
#Filter out top three clubs
df_topthree_clubs = df_players_regular[df_players_regular['club'].isin(['Liverpool',
                                                'FC Bayern München','FC Barcelona'])].copy()
#Use the Technical skill set from metadata file
l=df_technical_attr.append(pd.Series(['wage_eur','preferred_position','overall']))
#Plot for English Premier League
df_topthree_clubs = df_topthree_clubs[l]
## mean wage of top three teams 117037
## mean buget we are allocated 100000 (991066)
df_topthree_clubs['wage_eur'].mean()
df_topthree_clubs['preferred_position'].value_counts();
df_topthree_clubs.boxplot(column='wage_eur', by = 'preferred_position',figsize=(13, 6));
df_topthree_clubs.overall.mean()
df_topthree_clubs
Out[61]:
117037.97468354431
Out[61]:
CM     20
CB     15
ST      7
RB      7
LB      7
CDM     6
RW      4
LM      4
LW      3
RM      2
CAM     2
CF      2
Name: preferred_position, dtype: int64
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a23fd7d50>
Out[61]:
78.49367088607595
Out[61]:
pace shooting passing dribbling defending physic wage_eur preferred_position overall
0 87.0 92.0 92.0 96.0 39.0 66.0 565000.0 RW 94
7 77.0 60.0 70.0 71.0 90.0 86.0 200000.0 CB 90
9 93.0 86.0 81.0 89.0 45.0 74.0 240000.0 RW 90
19 73.0 89.0 80.0 84.0 51.0 84.0 355000.0 ST 89
20 77.0 87.0 74.0 85.0 41.0 82.0 235000.0 ST 89
... ... ... ... ... ... ... ... ... ...
13104 72.0 36.0 57.0 64.0 61.0 62.0 1000.0 RB 62
13978 61.0 56.0 60.0 66.0 48.0 53.0 4000.0 CM 61
14637 68.0 60.0 51.0 59.0 30.0 49.0 5000.0 ST 60
14645 66.0 36.0 57.0 60.0 57.0 58.0 5000.0 LB 60
15347 74.0 47.0 52.0 63.0 56.0 56.0 3000.0 LB 59

79 rows × 9 columns

Now we will get a list of players making less than 100K euros and take the 99.8 percentile, we end up with 78 players to choose from
We can also compare this with the averages of these players with the top league players and the wages paid.

In [62]:
#lets budget 85K players df_players.wage_eur <=0]
df_our_selection = df_players_regular[(df_players_regular['wage_eur'] <=100000)].copy()

#98.8%
value_95 = np.percentile(df_our_selection.overall, 99.8)
value_95
df_our_selection = df_players_regular[df_players_regular['overall'] >= value_95][(df_players_regular['wage_eur'] <=100000) & 
                                     (df_players_regular['overall']>= value_95)]
df_our_selection.overall.mean()
df_our_selection.boxplot(column='wage_eur', by = 'preferred_position',figsize=(13, 6))
df_our_selection.shape
Out[62]:
83.0
Out[62]:
83.57692307692308
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a24826450>
Out[62]:
(78, 87)

We will build the comparative data of "Our Team" and the "Top Three" to plot the spider chart.

In [63]:
l=df_technical_attr.append(pd.Series(['wage_eur','preferred_position','overall']))
#Plot for English Premier League
df_our_selection = df_our_selection[l]
df_topthree_clubs['club'] = 'Top3 Clubs'
df_our_selection['club'] = 'Our Selection'
frames = [df_topthree_clubs, df_our_selection]

result = pd.concat(frames)

result = result.groupby('club').mean().reset_index()
result
Out[63]:
club pace shooting passing dribbling defending physic wage_eur overall
0 Our Selection 72.512821 69.858974 73.910256 77.679487 61.307692 72.846154 58576.923077 83.576923
1 Top3 Clubs 73.544304 63.430380 70.911392 74.898734 63.278481 69.759494 117037.974684 78.493671

In the plot below, we see that "Our Selection" in blue out performas the top three teams in average skills of shooting, passing, dribbling, physic and overall. In defending and pace, our selection is a bit lower. Comparing the wages on a percent basis, our selection would cost 50% less.
This is just a theoritical analysis which as we assume we can use any players in the data set.

In [64]:
#df_technical_attr
l=df_technical_attr.append(pd.Series(['overall','wage_eur']))
#l
result['wage_eur'] =(result.wage_eur/result.wage_eur.sum())*100
plot_spider(result,l, 'Our Selection vs Top three');
result
Out[64]:
club pace shooting passing dribbling defending physic wage_eur overall
0 Our Selection 72.512821 69.858974 73.910256 77.679487 61.307692 72.846154 33.355327 83.576923
1 Top3 Clubs 73.544304 63.430380 70.911392 74.898734 63.278481 69.759494 66.644673 78.493671

Exceptional Work

A) Classification prediction of player category

We will use the features we analyzed in the section explore attributes and class to classify players in four categories (GK, FWD, DEF, MID).

In [65]:
df_pred_pos=df_players_cleaned.copy()
## We see some imbalance in categories of data.
position_counts = pd.DataFrame(df_pred_pos['preferred_position_cat'].value_counts())
position_counts['Percentage'] = position_counts['preferred_position_cat']/position_counts.sum()[0]
position_counts
#df_pred_pos.info() 
Out[65]:
preferred_position_cat Percentage
DEF 7362 0.402779
FWD 4588 0.251012
MID 4292 0.234818
GK 2036 0.111391

This is a unbalanced dataset with the percentage distribution shown below. We will use the existing distribution for now.

In [66]:
plt.figure(figsize=(4,4))
plt.pie(position_counts['Percentage'],
       labels = ['DEF', 'FWD', 'MID', 'GK']);
In [67]:
from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:
#All mental attributes for all players
l=df_mental_attr.append(df_attacking_attr)
l=l.append(df_skill_attr)
l=l.append(df_movement_attr)
l=l.append(df_power_attr)
l=l.append(df_defending_attr)
l = l.append(pd.Series(['height_cm','weight_kg']))

#l=l.append(pd.Series(['wage_eur','overall']))
#l

y = df_pred_pos.preferred_position_cat.values # get the labels we want
X = df_pred_pos[l].values # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
        
# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,test_size  = 0.2)
                         
print(cv_object)
ShuffleSplit(n_splits=3, random_state=None, test_size=0.2, train_size=None)

We will do a logistic regression with three fold cross validation and check the confusion matrix and the accuracy score for each run.

In [68]:
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' ) # get object
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
    clf =lr_clf.fit(X[train_indices],y[train_indices])  # train object
    y_hat = lr_clf.predict(X[test_indices]) # get test set precitions
    
    # print the accuracy and confusion matrix 
    print("====Iteration",iter_num," ====")
    print("accuracy", mt.accuracy_score(y[test_indices],y_hat)) 
    print("confusion matrix\n",mt.confusion_matrix(y[test_indices],y_hat))
    plot_confusion_matrix(clf, X[test_indices],y[test_indices],cmap=plt.cm.Blues,values_format='d')
    plt.grid(b=None);
    print(classification_report(y[test_indices], y_hat, target_names=['DEF', 'FWD', 'GK','MD']))
====Iteration 0  ====
accuracy 0.8525711159737418
confusion matrix
 [[1432    0    0   52]
 [   3  732    0  188]
 [   0    0  434    0]
 [ 120  176    0  519]]
Out[68]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1a23417a90>
              precision    recall  f1-score   support

         DEF       0.92      0.96      0.94      1484
         FWD       0.81      0.79      0.80       923
          GK       1.00      1.00      1.00       434
          MD       0.68      0.64      0.66       815

    accuracy                           0.85      3656
   macro avg       0.85      0.85      0.85      3656
weighted avg       0.85      0.85      0.85      3656

====Iteration 1  ====
accuracy 0.8498358862144421
confusion matrix
 [[1407    0    0   58]
 [   9  727    0  192]
 [   0    0  400    0]
 [ 121  169    0  573]]
Out[68]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1a28696150>
              precision    recall  f1-score   support

         DEF       0.92      0.96      0.94      1465
         FWD       0.81      0.78      0.80       928
          GK       1.00      1.00      1.00       400
          MD       0.70      0.66      0.68       863

    accuracy                           0.85      3656
   macro avg       0.86      0.85      0.85      3656
weighted avg       0.85      0.85      0.85      3656

====Iteration 2  ====
accuracy 0.8388949671772429
confusion matrix
 [[1369    4    0   65]
 [   5  722    0  197]
 [   0    0  412    0]
 [ 115  203    0  564]]
Out[68]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1a28531fd0>
              precision    recall  f1-score   support

         DEF       0.92      0.95      0.94      1438
         FWD       0.78      0.78      0.78       924
          GK       1.00      1.00      1.00       412
          MD       0.68      0.64      0.66       882

    accuracy                           0.84      3656
   macro avg       0.84      0.84      0.84      3656
weighted avg       0.84      0.84      0.84      3656

We were able to get a pretty good f1 score, which is a combination of sensitivity and specificity, for goal keepers and defenders. For forward position, the f1 score was reasonalbe at 78 but a bit low for midfielders at 66. We see that we had some degree of overlap between midfielders and forwards as their skills and physical attributes overlapped

We use the following weights for interpretation but is hard to provide it.

In [69]:
weights = lr_clf.coef_.T # take transpose to make a column vector
variable_names = l
for coef, name in zip(weights,variable_names):
    print(name, 'has weight of', coef[0])
mentality_aggression has weight of -0.008417359763180957
mentality_interceptions has weight of 0.11245938863103466
mentality_positioning has weight of -0.054792691851818155
mentality_vision has weight of -0.10026690879939479
mentality_penalties has weight of -0.0008705976031243018
mentality_composure has weight of 0.0064435018722904655
attacking_crossing has weight of 0.03725050350162565
attacking_finishing has weight of -0.05384427538070669
attacking_heading_accuracy has weight of 0.009375005198822242
attacking_short_passing has weight of -0.09332250416406027
attacking_volleys has weight of 0.00477809960699241
skill_dribbling has weight of -0.0386197083841056
skill_curve has weight of 0.019985994518187614
skill_fk_accuracy has weight of 0.0011489294278763976
skill_long_passing has weight of -0.052552513466437134
skill_ball_control has weight of -0.02473687693794122
movement_acceleration has weight of -0.0033011795352923395
movement_sprint_speed has weight of 0.026274358849179034
movement_agility has weight of 0.0032633825027011476
movement_reactions has weight of -0.06452129220797811
movement_balance has weight of -0.021041038274015024
power_shot_power has weight of -0.010881496532314228
power_jumping has weight of 0.004975810605998517
power_stamina has weight of -0.005409340433701218
power_strength has weight of 0.014225017659079253
power_long_shots has weight of -0.01701158665581218
defending_marking has weight of 0.13137511611953973
defending_standing_tackle has weight of 0.10721895273793296
defending_sliding_tackle has weight of 0.10649176943920918
height_cm has weight of 0.00659541361633567
weight_kg has weight of -0.03971127849226783

B) Principal Component Analysis

In this section, we will use Principal Component Analysis (PCA), an unsupervised linear transformation technique for dimensionality reduction, to study the relationships between 4 main positions: Defense players (DEF), Mid Field players (MID), Forward players (FWD), Goal Keepers (GK). It is a useful technique when we have a large number of correlated features in the dataset. It allows us to summarize the information with a smaller number of collectively representative variables.

We now create a new dataset with all the skill features.

In [70]:
df_players_pca1 = df_players[['preferred_position_cat','pace', 'shooting', 'passing', 'dribbling',
       'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking',
       'gk_reflexes', 'gk_speed', 'gk_positioning', 'attacking_crossing',
       'attacking_finishing', 'attacking_heading_accuracy',
       'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
       'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
       'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
       'movement_agility', 'movement_reactions', 'movement_balance',
       'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
       'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking', 'defending_standing_tackle',
       'defending_sliding_tackle']]

position_pca_withGK = df_players_pca1.preferred_position_cat

Lets find the two best dimensions of this dataset. We will print the components of the PCA.

In [71]:
from sklearn.decomposition import PCA

df_players_pca2 = df_players_pca1
df_players_pca2.drop(["preferred_position_cat"], axis = 1, inplace = True)
X = df_players_pca2.values

pca = PCA(n_components=2)
X_pca = pca.fit(X).transform(X) # fit data and then transform it

# print the components

print ('pca:', pca.components_)
pca: [[-2.32492063e-01 -2.04407504e-01 -2.18573035e-01 -2.32181428e-01
  -1.78989819e-01 -2.18866661e-01  2.08128932e-01  2.00916876e-01
   1.96664865e-01  2.11202126e-01  1.19457098e-01  2.01590529e-01
  -1.69547397e-01 -1.58722617e-01 -1.44454520e-01 -1.44784518e-01
  -1.47010095e-01 -1.87947122e-01 -1.62778974e-01 -1.43168489e-01
  -1.31620816e-01 -1.71997265e-01 -1.13962457e-01 -1.12299038e-01
  -1.10992490e-01 -3.67125741e-02 -9.44414756e-02 -7.82459709e-02
  -3.36644561e-02 -1.42007111e-01 -2.11755314e-02 -1.69872949e-01
  -1.27409866e-01 -1.26418824e-01 -1.80067529e-01 -9.51646863e-02
  -1.29486021e-01 -8.48367265e-02 -1.31303140e-01 -1.35387220e-01
  -1.26164384e-01]
 [-6.42389242e-02 -1.81666444e-01 -2.44346766e-02 -7.91296110e-02
   3.23520955e-01  1.36365705e-01 -4.69047116e-02 -4.51457205e-02
  -4.43547110e-02 -4.76496257e-02 -2.84740854e-02 -4.55552731e-02
  -5.89005946e-02 -2.44705775e-01  9.88129394e-02 -1.23829272e-04
  -1.85516571e-01 -1.20858007e-01 -1.32109794e-01 -1.05708429e-01
   3.03538255e-02 -6.00727334e-02 -1.00412225e-01 -8.66236540e-02
  -1.20056086e-01 -4.03511550e-03 -9.07656568e-02 -1.27766292e-01
   5.51185609e-02  5.23480043e-02  1.04888246e-01 -1.73761598e-01
   1.79061885e-01  3.35185785e-01 -1.78228352e-01 -1.17640324e-01
  -1.41076997e-01 -1.62294972e-02  3.09966203e-01  3.57800581e-01
   3.58132202e-01]]
In [72]:
# this function definition just formats the weights into readable strings
def get_feature_names_from_weights(weights, names):
    tmp_array = []
    for comp in weights:
        tmp_string = ''
        for fidx,f in enumerate(names):
            if fidx>0 and comp[fidx]>=0:
                tmp_string+='+'
            tmp_string += '%.2f*%s ' % (comp[fidx],f[:-5])
        tmp_array.append(tmp_string)
    return tmp_array

feature_names = ['pace', 'shooting', 'passing', 'dribbling',
       'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking',
       'gk_reflexes', 'gk_speed', 'gk_positioning', 'attacking_crossing',
       'attacking_finishing', 'attacking_heading_accuracy',
       'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
       'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
       'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
       'movement_agility', 'movement_reactions', 'movement_balance',
       'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
       'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking', 'defending_standing_tackle',
       'defending_sliding_tackle']
  
# now let's get to the Data Analytics!
pca_weight_strings = get_feature_names_from_weights(pca.components_, feature_names) 

# create some pandas dataframes from the transformed outputsposition_pca_withGK = df_players_pca1.preferred_position_cat

df_pca1 = pd.DataFrame(X_pca,columns=[pca_weight_strings])

df_pca1['preferred_position_cat']=position_pca_withGK
In [73]:
#sns.regplot(x=df_pca.iloc[:,0], y=df_pca.iloc[:,1])
plt.figure(figsize=(10,10));

ax = sns.scatterplot(x=df_pca1.iloc[:,0], y=df_pca1.iloc[:,1],hue=df_pca1.iloc[:,2]);
ax.set(xlabel="PCA1", ylabel = "PCA2", title = "Principal Component Analysis with all player positions");

By x-axis (PCA1), the first component separates Goal Keepers (GK) with players in other positions as two clusters. By y-axis (PCA2), we can see the second component divides the cluster into 3 groups (by 3 colors green-red-blue) for 3 positions - Defense players (DEF), Mid Field players (MID) and Forward players (FWD). It makes sense because GK is a position that is different to other positions.

By visualization, we can guess that the first component seems to be dominated by players from other positions (not by GK position). We will exclude the GK postion in the next step and then repeat the same study to verify our prediction.

We now create new dataframe without GK skills and then we drop the GK position.

In [74]:
df_players_pca3 = df_players[['preferred_position_cat','pace', 'shooting', 'passing', 'dribbling',
       'defending', 'physic',  'attacking_crossing',
       'attacking_finishing', 'attacking_heading_accuracy',
       'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
       'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
       'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
       'movement_agility', 'movement_reactions', 'movement_balance',
       'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
       'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking', 'defending_standing_tackle',
       'defending_sliding_tackle']]
In [75]:
df_players_pca4 = df_players_pca3[df_players_pca3['preferred_position_cat'] != "GK"].copy()
df_players_pca4 = df_players_pca4.reset_index(drop=True)
position_pca_noGK = df_players_pca4.preferred_position_cat

The two best dimensions of the dataset without GK position as follows.

In [76]:
df_players_pca5 = df_players_pca4
df_players_pca5.drop(["preferred_position_cat"], axis = 1, inplace = True)
X = df_players_pca5.values

pca = PCA(n_components=2)
X_pca = pca.fit(X).transform(X) # fit data and then transform it

# print the components

print ('pca:', pca.components_)
pca: [[ 0.11905409  0.25187068  0.12495182  0.16395475 -0.20252757 -0.04995967
   0.15217403  0.29121778 -0.04076931  0.08092266  0.24132602  0.19670097
   0.21761527  0.18628804  0.06284644  0.13717563  0.12699335  0.11274799
   0.1579853   0.05346836  0.11891073  0.1830293  -0.03092902  0.02352381
  -0.0743232   0.2534202  -0.08564725 -0.21387233  0.2454299   0.18244145
   0.18326464  0.07994962 -0.19839821 -0.23861578 -0.24499156]
 [-0.0030739   0.07367944  0.17777238  0.10660437  0.30022712  0.13672074
   0.18676057  0.02463051  0.09548724  0.17077771  0.07099597  0.10893516
   0.16451667  0.16675395  0.22692024  0.13198567 -0.00308745 -0.00316013
   0.04037042  0.14840369  0.02064369  0.12738459  0.06275997  0.15369475
   0.09615907  0.13721417  0.2352215   0.3447918   0.08217598  0.1504521
   0.04201254  0.16068657  0.30385719  0.33012182  0.31598788]]
In [77]:
feature_names = ['pace', 'shooting', 'passing', 'dribbling',
       'defending', 'physic',  'attacking_crossing',
       'attacking_finishing', 'attacking_heading_accuracy',
       'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
       'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
       'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
       'movement_agility', 'movement_reactions', 'movement_balance',
       'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
       'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
       'mentality_positioning', 'mentality_vision', 'mentality_penalties',
       'mentality_composure', 'defending_marking', 'defending_standing_tackle',
       'defending_sliding_tackle']
  
# now let's get to the Data Analytics!
pca_weight_strings = get_feature_names_from_weights(pca.components_, feature_names) 

# create some pandas dataframes from the transformed outputs
df_pca2 = pd.DataFrame(X_pca,columns=[pca_weight_strings])
df_pca2['preferred_position_cat']=position_pca_noGK
In [78]:
#sns.regplot(x=df_pca.iloc[:,0], y=df_pca.iloc[:,1])
plt.figure(figsize=(10,10));

ax = sns.scatterplot(x=df_pca2.iloc[:,0], y=df_pca2.iloc[:,1],hue=df_pca2.iloc[:,2]);
ax.set(xlabel="PCA1", ylabel = "PCA2", title = "Principal Component Analysis without GK position");

As our initial prediction, the first component is dominated by DEF, MID and FWD positions. With the second analysis without GK postion, the above plot shows us the relationship of these three positions (in the same cluster) but this cluster still divides into 3 parts represented for DEF, MID and FWD positions. We see some over lap between MID and FWD.